Morpho-syntactic ambiguity and tagset design for Hungarian

نویسندگان

  • Tamás Váradi
  • Csaba Oravecz
چکیده

The paper reports on work in progress to develop a tag set for Hungarian. The rich morphological structure of the language makes tagging feasible only after a full-scale morphological analysis, which results in a magnitude of patterns that do not easily translate into a corpus tag set of manageable size. The paper analyses the extent and types of morpho-syntactic ambiguity found in a 21m word sample of the Hungarian National Corpus as a preparatory stage in establishing a practical tag set.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Principled Hidden Tagset Design for Tiered Tagging of Hungarian

For highly inflectional languages, the number of morpho-syntactic descriptions (MSD), required to descriptionally cover the content of a word-form lexicon, tends to rise quite rapidly, approaching a thousand or even more set of distinct codes. For the purpose of automatic disambiguation of arbitrary written texts, using such large tagsets would raise very many problems, starting from implementa...

متن کامل

An Overview of Data-Driven Part-of-Speech Tagging

Over the last twenty years or so, the approaches to partof-speech tagging based on machine learning techniques have been developed or ported to provide high-accuracy morpho-lexical annotation for an increasing number of languages. Given the large number of morpho-lexical descriptors for a morphologically complex language, one has to consider ways to avoid the data sparseness threat in standard ...

متن کامل

Using a Large Set of EAGLES-compliant Morpho-syntactic Descriptors as a Tagset for Probabilistic Tagging

The paper presents one way of reconciling data sparseness with the requirement of high accuracy tagging in terms of fine-grained tagsets. For lexicon encoding, EAGLES elaborated a set of recommendations aimed at covering multilingual requirements and therefore resulted in a large number of features and possible values. Such an encoding, used for tagging purposes, would lead to very large tagset...

متن کامل

Using a Large Set of EAGLES-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging

The paper presents one way of reconciling data sparseness with the requirement of high accuracy tagging in terms of fine-grained tagsets. For lexicon encoding, EAGLES elaborated a set of recommendations aimed at covering multilingual requirements and therefore resulted in a large number of features and possible values. Such an encoding, used for tagging purposes, would lead to very large tagset...

متن کامل

High Accuracy Tagging with Large Tagsets

The paper presents experiments and results related to morpho-syntactic (MS) tagging of a highly inflectional language, based on combining language models (LM) learnt from multiple register-diversified corpora. To cope with a large tagset (614 tags), our underlying tagger uses a hidden smaller tagset (92 tags), mapped back, after the proper tagging, into the initial tagset. The same text is tagg...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007